A Semantic Relatedness Measure Based on Combined Encyclopedic, 
Ontological and Collocational Knowledge 
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Abstract 

We describe a new semantic relatedness mea- 
sure combining the Wikipedia-based Explicit 
Semantic Analysis measure, the WordNet path 
measure and the mixed collocation index. Our 
measure achieves the currently highest results 
on the WS-353 test: a Spearman p coeffi- 
cient of 0.79 (vs. 0.75 in (Gabrilovich and 
Markovitch, 2007)) when applying the mea- 
sure directly, and a value of 0.87 (vs. 0.78 in 
(Agirre et al., 2009)) when using the predic- 
tion of a polynomial S VM classifier trained on 
our measure. 

In the appendix we discuss the adaptation of 
ESA to 2011 Wikipedia data, as well as vari- 
ous unsuccessful attempts to enhance ESA by 
filtering at word, sentence, and section level. 

1 Introduction 

1.1 Semantic Relatedness and Corpora 

Semantic relatedness describes the degree to which 
concepts are associated via any kind of semantic rela- 
tionship (Scriver, 2006). Its evaluation is a fundamen- 
tal NLP problem, with applications in word-sense dis- 
ambiguation, text classification, information retrieval, 
automatic summarization and many other fields. In re- 
cent decades, a great variety of relatedness measures 
have been defined, based on corpora such as Wikipedia, 
Wiktionary, WordNet, etc. 

Wikipedia is one of the most successful collaborative 
projects of all time. By a constantly growing number 
of additions, corrections and verifications, its contents 
grows in both quantity and quality, and is considered by 
many linguists as the corpus they had always dreamed 
of (Medelyan et al., 2008). 

By measuring the normalized tfidf values of words 
in a page, we can consider the page to be a weighted 
vector in the space of words. Inverting the matrix of 
these vectors we obtain weighted vectors of words in 
the space of pages. As every page deals with a single 
topic, we consider these vectors as being concept vec- 
tors. The ESA (Explicit Semantic Analysis) measure 
between two words is obtained by taking the cosine 
of their concept vectors (Gabrilovich and Markovitch, 
2007). 



Unlike Wikipedia, WordNet (Miller, 1995), a semi- 
formal lexical ontology (Huang et al., 2010), has a 
fine and carefully-crafted ontological structure: word 
senses are represented by sets of synonyms ("synsets"), 
and there is a graph structure on synsets based on hy- 
pernymic relations. Several WordNet-based semantic 
relatedness measures have been defined, based on dis- 
tances in the hypernymic graph, and often combined 
with word distribution in sense-tagged corpora. 

1.2 Evaluation of Results, WS-353 Test 

(Finkelstein et al., 2001) introduce WS-353, a semantic 
relatedness test set consisting of 353 word pairs 1 and a 
gold standard defined as the mean value of evaluations 
by up to 17 human judges. Although this test suite con- 
tains some quite controversial word pairs, 2 it has been 
widely used in literature and has become the de facto 
standard for semantic relatedness measure evaluation. 
Technically, the final result of the test is the Spear- 
man p rank correlation coefficient (Spearman, 1904) 
between the relatedness ranking of pairs by human 
judges and that by the tested algorithm. So, in fact, 
it is not the value obtained for each pair that counts, but 
only the ranks. 

1.3 Our Approach 

By closely examining word pairs that failed to be 
ranked correctly by ESA, we came to the conclusion 
that the WS-353 word pairs belong (non-exclusively) to 
four classes, corresponding to different kinds of seman- 
tic relatedness and requiring different kinds of knowl- 
edge: 

1 . encyclopedic: see Section 2; 

2. ontological: see Section 3; 

3. collocational: see Section 4; 

4. pragmatic: see Section 6. 

In this paper, we define a new semantic relatedness 
measure by combining knowledge related to these four 

classes. 



Actually 352 pairs, since "money / cash" appears twice. 
2 For example: "Arafat / terror" (0.765), "Arafat / peace" 
(0.673), "Jerusalem / Israel" (0.846), "Jerusalem / Pales- 
tinian" (0.765), etc. 



2 Encyclopedic Knowledge 

This class contains pairs that are best sorted by ESA. 
We note that (Agirre et al., 2009) qualify ESA as a dis- 
tributional approach. Indeed, technically two words are 
semantically related in ESA if they appear together fre- 
quently in Wikipedia pages. But since pages are de- 
scriptions of topics (= concepts), words are ESA-close 
when they appear frequently in common concept de- 
scriptions, and therefore in common semantic domains. 
Hence, ESA is semantically richer than a merely distri- 
butional approach. 

ESA is the first and most important component 
of our combined relatedness measure. By adapting 
our implementation of the ESA algorithm to 2011 
Wikipedia data (see App. A), we obtain a Spearman 
p = 0.7394. In the following sections we describe the 
components added to ESA in order to optimize its per- 
formance even further. 

3 Ontological knowledge 

To get a better insight into the shortcomings of ESA on 
WS-353, we calculate Spearman p for the WS-353 set 
minus a single pair, for every pair. In Fig. 1 one can 
see the top 40 "most problematic" pairs: those whose 
removal increases p the most. By taking a closer look 
at them we can get hints for further improvements of 
the measure. 
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Figure 1 : Spearman p when removing a single pair from 
WS-353. 

First of all, we see pairs having a relation that is on- 
tological in nature: "tiger / feline" (hyponym), "mile / 
kilometer" (coordinate terms, or "classmates" (Kuroda 
et al., 2010b)), "dollar / buck" (synonyms), etc. These 
relations are strong enough to justify the presence of 
the pairs in the test set, but do not necessarily imply 
high frequency of terms in common Wikipedia pages. 

A good place for information of an ontological na- 
ture is WordNet. There have been several WordNet- 
based measures defined in the literature. When apply- 
ing them 3 to the WS-353 test set we get the following p: 

3 In fact, these measures apply to synsets rather than to 
words. To avoid going through a sense-disambiguation pro- 
cess, we take the optimistic approach of using for each pair 



WNP (Path-based) 


0.2873 


WUP (Wu and Palmer, 1994) 


0.1356 


RES (Resnik, 1995) 


0.2112 


JCN (Jiang and Conrath, 1997) 


0.3172 


LCH (Leacock and Chodorow, 1998) 


0.1437 


HSO (Hirst and St-Onge, 1998) 


0.1598 


LIN (Lin, 1998) 


0.1987 


LESK (Banerjee and Pedersen, 2002) 


0.1304 



Despite the fact that JCN (which combines 
WordNet-graph calculations and word frequencies 
from a corpus 4 ) rates best when used alone, the mea- 
sure which we are going to use is WNP, which gives 
the best results when combined with ESA (see below). 
This measure is based exclusively on the shortest-path 
distance in WordNet and hence is purely ontological. 
For example, the WNP-measure of "wood / forest" is 1 
(synonyms), "bird / cock" is 0.5 (hypernym), "century / 
year" is 0.33, "bishop / rabbi" is 0.25, etc. 

We found that this measure provides bad results in 
its lower range (since the path length between distant 
nodes strongly depends on the density of WordNet for 
each knowledge domain). To understand the behavior 
of ESA and WNP measures in their low ranges, we 
progressively remove pairs from WS-353 in order of 
increasing relatedness. 
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Figure 2: The effect on Spearman p of the progressive 
removal of pairs in order of increasing relatedness, for 
ESA and WNP measures. 

As we can see in Fig. 2, removing pairs in the small- 
value range of the measure strongly decreases ESA 
(which, after half of the pairs are removed, becomes 
chaotic), while the same operation steadily increases 
WNP. In other words, small-value pairs are crucial pos- 
itive contributors for ESA, but rather negative contrib- 
utors for WNP. For this reason, we use only the up- 
per range of WNP, and ignore its results for low- valued 
pairs. To achieve a smooth "fade-out" of WNP's lower 
range we multiply it by a sigmoid logistic function. We 



of words, the pair of senses which are the most closely re- 
lated. Hence, if (j, is a synset-measure, s, s' are synsets 
and in, w' words, we define the induced word-measure p as 
p(w, w') :— max s3WtS > 3w > p,(s, s'). 

4 For the distributional part of Jiang & Conrath, Resnik 
and Lin, we use the Wikipedia 201 1 corpus. 



hence define a new measure 



Hew(wi,w 2 ) = Hesa{wi,w 2 ) 



(1 + Xa m , a (nwNp(wi,W2))), 



(1) 



where A weights WNP with respect to ESA, m is the 
sigmoid inflection point (= a soft boundary of WNP's 
lower range), s is the steepness of the sigmoid (small s 
makes the central part of the sigmoid closer to a vertical 
line), and "EW" stands for "ESA and WordNet." 
Calculations give the following optimal result: 



4.665, m = 0.26, s= 0.05 



0.7779 



which surpasses the (Gabrilovich and Markovitch, 
2007) ESA result of 0.75 by 5.2%. The parameter val- 
ues have been obtained by gradient descent. In the next 
section we will further enhance this result by taking 
collocations into account. 

4 Collocational Knowledge 

Returning to Fig. 1, we see that many "problematic" 

pairs are in fact collocations: "baseball / season," 

"money / laundering", "hundred / percent," etc. We 

claim that the collocational nature of these word pairs 

has motivated their inclusion in WS-353. To show 

this, we calculated the collocation index (defined as 
2 #(^ 2 ) , f all ws _ 353 jj. 5 
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Figure 3: Top twenty direct (in gray) and inverse (in red) 
collocation indices for WS-353. 

The primary goal of WS-353 is to evaluate relat- 
edness measures, and these are symmetric by defini- 
tion (we always have n(w\,W2) = /J,(w2,wi)), If 
the word pairs were chosen on strictly semantic crite- 
ria, and if collocations were purely accidental, then we 
would have a roughly equal number of pairs (w\, W2) 
where W1W2 is a collocation and pairs where W2W1 is a 
collocation. 

Fig. 3 shows that this is not the case: for the word 
pairs concerned, WS-353 developers have almost sys- 
tematically chosen to write the words in the order in 
which they form a collocation. 

5 We obtained WS-353 pair and word frequencies from 
the 53.45 billion-word GoogleBooks corpus (Michel et al., 
201 1). We considered only books published after 1970. 



But neither ESA nor WNP recognize collocations: 
the former because of the bag-of-words principle un- 
derlying tfidf, and the latter only in the case where the 
collocational pair is a concept on its own. Indeed, most 
of the collocations in Fig. 3 are WordNet concepts (the 
exceptions being: "gender / equality," "food / prepa- 
ration," "secret / weapon," "energy / crisis," etc.) but 
knowledge of that fact is not sufficient for ranking, 
since there is no mention in WordNet of the strength 
of the collocational relation. 

We use the collocation index to further enhance our 
EW relatedness measure. 
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Figure 4: Collocation index vs. Spearman stability of 
EW. The red line is LOWESS polynomial regression 
(Cleveland, 1981). 

Note that this index is not a measure (for example, 
the collocation index of "tiger / tiger" is not 1) and can- 
not be used directly as such. 

How do collocational pairs contribute to the WS-353 
Spearman p value? In Fig. 4 one can compare collo- 
cation index and Spearman stability (that is, the effect 
on p of the removal of a single word pair). Pairs located 
on the green vertical line are those whose removal does 
not affect Spearman p. Those on the right increase p 
when removed. We observe that most collocations are 
on the right; in other words, they are negative contribu- 
tors. The most problematic ones are collocations which 
are not individual WordNet concepts (typical examples: 
"school / center," "hotel / reservation," "canyon / land- 
scape," etc.). 

On the other hand, on the left side we find collo- 
cations that contribute positively to p: in many cases 
these have a strong ontological relation ("tiger / tiger," 
"street / avenue," "football / soccer," etc.) which 
is probably the main reason for their positive contri- 
bution. The LOWESS polynomial regression line is 
quasi-horizontal, so we cannot infer whether or not col- 
location index is correlated with p. 

An auxiliary question is whether collocation index 
values (at least in the high range) are correlated with the 
actual values of the WS-353 gold standard. Fig. 5 com- 
pares these two quantities. As we can see, LOWESS 
polynomial regression is almost steadily monotonically 



increasing, which shows that, although not a measure 
per se, (high-range) collocation index could be useful 
for relatedness measurement. 
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Figure 5: Collocation index vs. WS-353 gold standard. 
The red line is LOWESS polynomial regression. 

We combine the previously defined EW measure 
with collocation index, by defining measure EWC 
(= "ESA + WordNet + collocations") as follows: 



|UEWC(W1,W2) 



(J-esa{wi,w 2 ) 

■ (1 + Xa m , a (fiWNp('U>l,W2))) 

■ (1 + \'(T m > iS >(Cz(w 1 ,W 2 ))), 



where \,m,s are as in (1), A', mf and s' are similar, 
and the mixed collocation index C^ is defined as fol- 
lows: 



C^{wi,w 2 ) 



2#(w 1 w 2 ) 



i- 



2#(w 2 wi) 



#( Wl ) + #( W2 ) ^#( Wl ) + #( W2 ) 



where #(.) is the frequency in the corpus. 

Calculations give the following optimal result: 



A = 5.16, ^ = 0.25, A' =48.7, 
p! = 0.19, s = s' = 0.05,£ = 0.55 



0.7874 



which is 1.2% higher than EW and, to the best of our 
knowledge, currently the highest result for WS-353 by 
a direct measure (not using a support vector machine). 
The parameter values have been obtained by gradient 
descent. 

We can interpret this result as follows: the EWC 
measure works best when the lower fourth of WordNet 
measure and the lower fifth of collocation index values 
are ignored, and when inverse collocations count half 
as much as direct ones. 

5 Supervised Approach Using an SVM 

(Agirre et al., 2009, p. 25) train an SVM on pairs of 
WS-353 pairs; this allows them to get an insight on 
performance increase obtained by combining various 
measures. By combining knowledge from a Web cor- 
pus and from WordNet, they obtain a highest value of 
Spearman p = 0.78. We calculated predictions of (4th 
degree polynomial) SVMs based on our EW and EWC 



measures, and obtained the following results, using 10- 
fold cross-validation: 

Measure Result 



EW (ESA + WNP) p = 0.7996 

EWC (ESA + WNP + collocations) p = 0.8654 



We observe that even without collocations we al- 
ready get a better value than (Agirre et al., 2009), and 
also that the collocation component increases this value 
significantly, hence validating our choice of using col- 
locational knowledge to enhance semantic relatedness 
measurement. 

6 Pragmatic Knowledge 

This class contains pairs not captured by the previous 
methods. The typical example is "hotel / reservation": 
its ESA value is very low, there is no ontological re- 
lation, and the collocation index is quite low as well. 
To capture the relatedness of such a pair, we need spe- 
cific knowledge domain ontologies, providing relations 
such as "A is part of a functional process of system B" 
(in this case: "a 'reservation' is part of the process of 
renting a room in a 'hotel' "). We leave this as an open 
task for future development. 

7 Conclusion 

By combining two pre-existing semantic relatedness 
measures and by adding a component based on fre- 
quency of collocations, we have obtained a new mea- 
sure that surpasses the one given in (Agirre et al., 2009) 
by 11% (when comparing results obtained by SVMs). 
We conjecture that this measure can further be en- 
hanced by using pragmatic knowledge taken, for ex- 
ample, from specialized domain ontologies. 

Appendices 

A Adapting ESA to 2011 Wikipedia 

The original (and unreleased) C++ ESA implementa- 
tion (Gabrilovich and Markovitch, 2007) is based on 
2005 Wikipedia data (2.2 GB) and achieves a Spear- 
man p = 0.75. A later implementation in Python and 
Java (Calh, 2010), based on the same corpus, achieves 
p = 0.74. We implemented ESA in Perl and similarly 
obtained p = 0.7404 when based on 2005 data. The 
same algorithms applied to 2011 data (31GB), pro- 
duced a disappointing p = 0.7047. Indeed, between 
2005 and 201 1, Wikipedia has evolved as follows: 



#concepts 
#terms/concept 



2005 
866,881 
96.1971 



2011 

4,178,454 

97.4243 



where by "concepts" we mean Wikipedia pages in the 
main namespace, and by "terms," distinct stemmed 
words. 

Following advice by Gabrilovich (personal commu- 
nication), we increased the generality of concepts by 
filtering Wikipedia pages by two criteria: minimum 
number of terms, and minimum number of in- and out- 
going links. The original values were: 100 terms and 
5 links; by requiring a minimum of 200 terms and 14 



links, we have attained the 2005 p value (more pre- 
cisely: p = 0.7394). Fig. 6 displays p as a function 
of our two criteria. 




20 10 

Min # of links per page 



150 200 
Min # of words per page 



Figure 6: Adapting ESA to 201 1 Wikipedia data by increas- 
ing the minimum number of distinct (stemmed) words and of 
in- and outgoing links per page. 

In the following table, the column 201 1 displays the 
results with original ESA setting, 2011* the ones with 
modified settings, df is mean document frequency of 



terms and term density is 



df 
#concepts ' 





2005 


2011 


2011* 


#concepts 

#terms/concept 

#terms 

df 

term density 


132,689 

165 
187,971 
116.3307 
0.00088 


311,209 

279 
503,368 
173.7199 
0.00056 


155,767 

414 
408,299 
159.0395 
0.00102 



As we see, terms are less densely distributed in the 
201 1 corpus, since the increase of their mean document 
frequency, though important, is overruled by an even 
more important increase in the number of concepts. By 
more efficiently pruning concepts and leaving df rela- 
tively stable, we manage to increase term density anew 
and hence, enhance performance. 

B Experiments 

(Giraud-Carrier and Dunham, 2010) emphasize the im- 
portance of sharing negative results. Responding to 
their call, here are some of our failed attempts at in- 
creasing ESA performance on the 2005 corpus. Note 
that the standard ESA value we challenge is p = 
0.7404. 

B.l At the Word Level: Lemmatization and POS 
Filtering 

ESA removes stop words and words with fewer than 
three letters before applying the Porter stemmer thrice. 
Instead of stemming, we lemmatized and then applied 
two strategies: keeping only nouns and proper names 
(Penn tags NN, NNP, and plurals), or also verbs and 
adjectives (tags starting with NN, NNP, VB, and JJ). 
Here are the results obtained: 



Penn tags NN, NNS, NNP, NNPS 
Penn tags NN*, NNP*, VB*, JJ* 



p = 0.7194 
p = 0.7178 



The performance loss is due to lemmatization, proving 
once again that while Porter stemming may seem a bru- 
tal technique, it works better than anything else. Note 
that, surprisingly, when adding verbs and adjectives we 
get a (slightly) smaller p. 

B.2 Filtering at the Sentence Level 

We attempted to triple the weight of sentences con- 
taining either the page title, or one of the (non stop- 
)words of the page title, or one of the anchors point- 
ing to the page. This operation affected 1,399,165 sen- 
tences. Here are the results obtained: 

| Tripling weight of selected sentences | p — 0.7293 | 

B.3 Filtering at the Section Level 

The idea is to avoid "historical sections" in pages de- 
scribing current notions or objects. Historical sections 
are detected by a higher frequency of past-tense verbs, 
unless of course the whole page is of a historical na- 
ture, and hence using primarily the past tense. Let 
n = #past-ten^verbs for eac h Wikipedia page. We pruned 
sections of it > 0.8 when the page had ir < 0.8. 
We also pruned sections named "History," "External 
links," "References," "See also," "Further reading," and 
"Bibliography." This affected 111,028 sections out of 
470,948. Here are the results obtained: 

Pruning of "historical" and other sections | p = 0.6608 

C Implementation Details 

Implementation of ESA was done from scratch in 
Lex and Perl. To access WordNet v3, we used 
the Perl module WordNet: : Similarity (Peder- 
sen et al., 2004). SVM calculations as well as 2D 
figures were done in R, and the 3D figure in Mat- 
lab. For lemmatizing and POS -tagging, we used Tree- 
Tagger (Schmid, 1994). Our code is publicly avail- 
able at http://omega2.enstb.org/yannis/ 
similarity . php. 
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